- Embedding Generation: Generate embeddings using Triton Nomic Embedding model
- RAG Support: Retrieval-Augmented Generation with Elasticsearch vector stores
- Hybrid Retrieval: Query fusion retriever combining multiple data sources
- LLM Reranking: Rerank retrieved documents using LLM
- MongoDB Integration: Store and retrieve prompts and configurations
- Health Checks: Monitor service status and component availability
- Async Support: Asynchronous operations for better performance
embedding-service/
├── app/
│ ├── app.py # Application entry point
│ ├── single_index.py # Main FastAPI application
│ ├── triton_nomic_embedding.py # Triton embedding client
│ └── main.py # (unused)
├── mongo/
│ └── mongodbservice.py # MongoDB service and repositories
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
└── README.md # This file
-
Clone the repository
cd /Users/vishan/PycharmProjects/Embedding-Service -
Create a virtual environment
python -m venv .venv source .venv/bin/activate # On macOS/Linux
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
cp .env.example .env # Edit .env with your configuration
Edit the .env file with your settings:
# Server
PORT=8000
HOST=0.0.0.0
# LLM Configuration
LLM_MODEL_NAME=llama-2-13b-chat
LLM_HOST=http://localhost:8000/v1
LLM_API_KEY=your-api-key
# Embedding Model
EMBEDDING_MODEL_NAME=nomic-ai_nomic-embed-text-v1.5-ensemble
EMBEDDING_API_BASE=http://localhost:8000
# MongoDB
MONGO_HOST=localhost
MONGO_USERNAME=admin
MONGO_PASSWORD=password
# Elasticsearch
ES_HOST=localhost
ES_USER=elastic
ES_PASSWORD=password
SECURITY_REPORT_INDEX_NAME=security_reports
CVE_INDEX_NAME=cve_data
# Retrieval Settings
TOP_K_AFTER_RERANK=5
SIMILARITY_TOP_K=10cd app
python app.pycd app
uvicorn single_index:app --host 0.0.0.0 --port 8000cd app
python single_index.pyGET /Returns service information and available endpoints.
Response:
{
"service": "Embedding Service API",
"version": "1.0.0",
"status": "running",
"endpoints": {
"health": "/health",
"embeddings": "/v1/embeddings",
"prompt": "/v1/prompt",
"retrieve": "/v1/retrieve"
}
}GET /healthCheck service health and component status.
Response:
{
"status": "healthy",
"embedding_model": true,
"vector_stores": true,
"mongodb": true
}POST /v1/embeddingsGenerate embeddings for provided texts.
Request:
{
"texts": ["Hello world", "How are you?"]
}Response:
{
"embeddings": [[0.1, 0.2, ...], [0.3, 0.4, ...]],
"model": "nomic-ai_nomic-embed-text-v1.5-ensemble",
"dimensions": 768
}POST /v1/retrieveRetrieve relevant documents without generating a response.
Request:
{
"query": "What are the security vulnerabilities?",
"summary": "Optional summary"
}Response:
{
"query": "What are the security vulnerabilities?",
"documents": [
{
"page": 1,
"file_path": "/path/to/file.pdf",
"file_name": "security_report.pdf",
"score": 0.95,
"text": "Document text...",
"type": "pdf",
"others": {}
}
],
"count": 1,
"has_context": true
}POST /v1/promptGenerate a RAG-enhanced prompt with retrieved context.
Request:
{
"query": "Explain the CVE-2023-1234",
"summary": "Optional summary"
}Response:
{
"response": "Context 1: ...\nContext 2: ...",
"metadata_list": [...],
"prompt": "Formatted prompt with context",
"system_message": "System prompt",
"has_context": true,
"retrievers_list": ["security_reports", "cve_store"]
}# Health check
curl http://localhost:8000/health
# Generate embeddings
curl -X POST http://localhost:8000/v1/embeddings \
-H "Content-Type: application/json" \
-d '{"texts": ["Hello world", "Test embedding"]}'
# Retrieve documents
curl -X POST http://localhost:8000/v1/retrieve \
-H "Content-Type: application/json" \
-d '{"query": "security vulnerabilities"}'import requests
# Generate embeddings
response = requests.post(
"http://localhost:8000/v1/embeddings",
json={"texts": ["Hello world", "Test embedding"]}
)
print(response.json())
# Retrieve documents
response = requests.post(
"http://localhost:8000/v1/retrieve",
json={"query": "security vulnerabilities"}
)
print(response.json())- Connects to Triton Inference Server
- Supports batch processing
- Handles base64 encoding for text inputs
- Applies L2 normalization and mean pooling
- Elasticsearch: Primary vector store for document retrieval
- Hybrid Retrieval: Combines multiple retrievers using reciprocal rank fusion
- LLM Reranking: Uses LLM to rerank retrieved documents
- Stores prompts and configurations
- Singleton pattern for connection pooling
- Automatic reconnection handling
The service provides graceful degradation:
- If Elasticsearch is unavailable, embeddings-only mode is enabled
- If MongoDB is unavailable, uses default prompts
- All endpoints return proper HTTP status codes and error messages
cd app
uvicorn single_index:app --reload --host 0.0.0.0 --port 8000# Check Python syntax
python -m py_compile app/single_index.py
# Run with debug logging
LOG_LEVEL=DEBUG python app/app.pyCreate /etc/systemd/system/embedding-service.service:
[Unit]
Description=Embedding Service API
After=network.target
[Service]
Type=simple
User=your-user
WorkingDirectory=/path/to/Embedding-Service/app
Environment="PATH=/path/to/.venv/bin"
ExecStart=/path/to/.venv/bin/python app.py
Restart=always
[Install]
WantedBy=multi-user.targetThen:
sudo systemctl daemon-reload
sudo systemctl enable embedding-service
sudo systemctl start embedding-serviceIf you encounter import errors with MongoDB:
# The service uses sys.path.append to handle imports
# Make sure you're running from the correct directory- Verify Triton server is running and accessible
- Check Elasticsearch cluster status
- Verify MongoDB connection string
- Adjust
max_batch_sizein TritonNomicEmbedding for better throughput - Tune
SIMILARITY_TOP_KandTOP_K_AFTER_RERANKfor retrieval quality - Use connection pooling for MongoDB (already configured)
A comprehensive FastAPI-based embedding service with RAG (Retrieval-Augmented Generation) capabilities, supporting Triton inference server, Elasticsearch vector stores, and MongoDB for prompt management.